class: center, middle, inverse, title-slide # Spatial Data Workshop: Working with Census Data and Introduction to Map Making ### Josemari Feliciano ### USDA Rural Development - Innovation CenterDepartment of Biostatistics, Yale University ### 04/06/2022 --- <style type="text/css"> body { font-size: 12px; } td { font-size: 12px; } code.r{ font-size: 12px; } .remark-code, .remark-inline-code { font-size: 90%; } </style> # Workshop Goals 1. Introduce you to various geographic boundaries (e.g., counties, tracts, block groups) we work with here in the US. 2. Learn how to properly define rural across various geographic boundaries. 3. Introduction to Census Geocoding tools using (a) their web interface and (b) the censusxy package. 4. Provide an overview of various datasets offered by the US Census Bureau. 5. Provide a detailed introduction to the American Community Survey (ACS) data. 6. Learn how to use R packages such as censusapi to seamlessly download and work with ACS data. 7. Learn the basics of static map making using ggplot2 and sf. 8. Time permitting: Learn the basics of interactive map making using leaflet. --- # Before we go further: Please sign up for a Census API key: https://api.census.gov/data/key_signup.html We will be using the following R packages later: --- ## Geographic Identifiers (GEOIDs): The Basics. __Geographic identifiers (or GEOIDs)__ are numeric codes that uniquely identify all administrative/legal and statistical geographic areas. - Without a common identifier among geographic and demographic datasets, researchers and other stakeholders would have a difficult time pairing the appropriate demographic data with the appropriate geographic data, thus considerably increasing data processing times and the likelihood of data inaccuracy. Here in the US, we _primarily_ use what are called __Federal Information Processing Series (FIPS) codes__. - Many US-based datasets would label their geographic and demographic datasets with either GEOID or FIPS to indicate the relevant code. Datasets use GEOID and FIPS interchangeably. - If you are working with spatial data, it is best to have the FIPS code to easily merge the datasets. --- ### Geographic hierarchies <img src="data:image/png;base64,#geography_level.PNG" width="60%" style="display: block; margin: auto;" /> Typically, the key notable geographic levels scientists and policy makers concern themselves with are: (1) State, (2) County, (3) Census Tract, (4) Census Block, and (5) Zip Code Tabulation Areas (ZCTAs). --- ## Geographic hierarchies <img src="data:image/png;base64,#geographic.png" width="80%" style="display: block; margin: auto;" /> --- ## Federal Information Processing Standards (FIPS) <img src="data:image/png;base64,#GEOIDStructure.PNG" width="55%" style="display: block; margin: auto;" /> Again, GEOID/FIPS codes are typically what we use to identify both the geographic level and specific location we are working with. FIPS and GEOID are often used synonymously with one another. --- ## State FIPS You can get this from many websites. This specific list is from [Census](https://www2.census.gov/geo/docs/reference/state.txt).
--- ## A Quick Detour: Non-Census Data at County and Tract Level Federal agencies and researchers are increasingly using the CDC/ATSDR __Social Vulnerability Index (SVI)__. __From CDC:__ "Natural disasters and infectious disease outbreaks can pose a threat to a community’s health. Socially vulnerable populations are especially at risk during public health emergencies because of factors like socioeconomic status, household composition, minority status, or housing type and transportation." __SVI Availability:__ Data are available at county- and tract- level. __Index Range:__ The index (labelled RPL_THEMES in the dataset) is a score between 0 (least vulnerable) and 1 (most vulnerable). --- ## SVI Components The latest SVI data is for 2018. The dataset calculates the SVI using Census ACS data (more on this later) for 2014-2018. Click [here](https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/SVI_documentation_2018.html) for detailed SVI 2018 documentation. <img src="data:image/png;base64,#CDC-SVI-Variables.jpg" width="60%" style="display: block; margin: auto;" /> --- ### Partial SVI Data at County-Level:
--- ### Partial SVI Data at Tract-Level:
--- ## How do we define rural geographically in the US? Major definitions we can use: - __Office of Management and Budget (OMB) Definition:__ County-level. - __USDA Rural-Urban Continuum Codes (RUCC):__ County-level. - __USDA Rural-Urban Commuting Area (RUCA):__ Tract-level. There are others that are somewhat difficult to work with (Census, Frontier and Remote Area Codes). Note: RUCA is also available at the zip code level (but I personally do not recommend this as zip code can change drastically across years). There is also another issue with postal zip codes vs zip-code tabulation areas (ZCTA) that others might find confusing (more on this later). --- ## Rural Definition: OMB OMB definition deals with another geographic level (core-based statistical areas). These statistical areas are composed of one or more counties. <img src="data:image/png;base64,#omb_map.gif" width="90%" style="display: block; margin: auto;" /> --- ## Rural Definition: OMB Rural is defined by OMB as core-based statistical areas that are not metropolitan areas (i.e, micropolitan areas and non-core areas). Core-based March 2020 data below is from [this Census dataset](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html).
--- ## Rural Definition: OMB OMB dataset is not very user friendly or "clean". Here's my processed clean data list of rural/non-rural counties:
--- ## Rural Definition: Rural-Urban Continuum Codes (RUCC) - Dataset created by the USDA Economic Research Service (ERS). - Goes beyond a binary classification: rural (non-metro) vs non-rural (metro). - Latest dataset is from 2013. Next data iteration should be out in 2023. - RUCC codes distinguish metropolitan (metro) counties by the population size of their metro area, and nonmetropolitan (nonmetro) counties by degree of urbanization and adjacency to metro areas. - Full documentation and dataset can be accessed by clicking [here](https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/). --- ## Rural Definition: Rural-Urban Continuum Codes (RUCC) Each county is assigned a RUCC between 1 and 9. RUCC Codes for Metro (non-rural) counties: - 1: Counties in metro areas of 1 million population or more - 2: Counties in metro areas of 250,000 to 1 million population - 3: Counties in metro areas of fewer than 250,000 population RUCC Codes for Non-metro (rural) counties: - 4: Urban population of 20,000 or more, adjacent to a metro area - 5: Urban population of 20,000 or more, not adjacent to a metro area - 6: Urban population of 2,500 to 19,999, adjacent to a metro area - 7: Urban population of 2,500 to 19,999, not adjacent to a metro area - 8: Completely rural or less than 2,500 urban population, adjacent to a metro area - 9: Completely rural or less than 2,500 urban population, not adjacent to a metro area --- ## Full RUCC Dataset
--- ## Rural Definition: Rural-Urban Commuting Area (RUCA) - Dataset created by the USDA Economic Research Service (ERS). - Dataset is from 2010. - RUCA classification is based on population density, urbanization, and daily commuting to identify urban cores and adjacent territory at the tract-level. - Data available at (1) tract- and (2) zip code- level. - Very technical methodology. For data and documentation, click [here](https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/) if you want to learn more. --- ## Rural Definition: Rural-Urban Commuting Area (RUCA) Each county is assigned a Primary RUCA code between 1 and 10 (99 if not coded). Primary RUCA Codes: - 1 - Metropolitan area core: primary flow within an urbanized area (UA) - 2 - Metropolitan area high commuting: primary flow 30% or more to a UA - 3 - Metropolitan area low commuting: primary flow 10% to 30% to a UA - 4 - Micropolitan area core: primary flow within an Urban Cluster of 10,000 to 49,999 (large UC) - 5 - Micropolitan high commuting: primary flow 30% or more to a large UC - 6 - Micropolitan low commuting: primary flow 10% to 30% to a large UC - 7 - Small town core: primary flow within an Urban Cluster of 2,500 to 9,999 (small UC) - 8 - Small town high commuting: primary flow 30% or more to a small UC - 9 - Small town low commuting: primary flow 10% to 30% to a small UC - 10 - Rural areas: primary flow to a tract outside a UA or UC - 99 - Not coded: Census tract has zero population and no rural-urban identifier information --- ## Sample RUCA Tract Data Only sample data (N=10) here as there are 70,000+ tracts. The full RUCA code dataset and documentation can be accessed [here](https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/).
--- ## Free geocoding tools You likely know which county you are in. However, you likely do not know the specific census tract or block group you are in. Luckily, Census has free geocoding tools (e.g., web interface, API) that can help us! The censusxy package (an API wrapper) in R provides an easy way to geocode your data that works with the Census Geocoding API. Let us go over basic geocoding examples using the censusxy package. --- ## The censusxy package: retrieving longitude and latitude. __Code for finding coordinates for an address.__ ```r library(censusxy) cxy_single('1600 Pennsylvania Avenue NW', 'Washington', 'DC', 20500) # cxy_single('1600 Pennsylvania Avenue NW', 'Washington', 'DC') will also work! ``` </br> __Returned output by API call:__
</br>Here, coordinates.x is the __longitude__, while coordinates.y is the __latitude__. --- ## The censusxy package: retrieving geographies. Earlier, we learned that the longitude and latitude for the White House (1600 PENNSYLVANIA AVE NW, WASHINGTON, DC, 20500) are -77.03534 and 38.89875, respectively. __Code for finding geographies for coordinates:__ ```r cxy_geography(-77.03534, 38.89875) ``` </br> __Note:__ The code above will return a data frame with 164 columns. The output below is a subset for specific GEOID variables I wanted (run the code yourself to see all 164 columns). __Partial Output:__
--- ## List of Census Surveys and Datasets </br></br></br> .pull-left[ The US Census Bureau conducts 130+ surveys each year. A detailed list can be accessed by clicking [this](https://www.census.gov/programs-surveys/surveys-programs.html). Let us quickly explore this list using a web browser. ] .pull-right[ <img src="data:image/png;base64,#census_list.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Quick overview of Small Area Health Insurance Estimates (SAHIE) .pull-left[ You may access the large yearly SAHIE datasets by clicking [this](https://www2.census.gov/programs-surveys/sahie/datasets/time-series/estimates-acs/). __Recommended:__ Accessing the data using the SAHIE interactive tool to minimize data cleaning. The tool can be accessed by clicking [this](https://www.census.gov/data-tools/demo/sahie/#/). This greatly minimizes data cleaning/subsetting tasks. ] .pull-right[ <img src="data:image/png;base64,#sahie_tool.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Quick overview of Longitudinal Employer-Household Dynamics (LEHD) Datasets - Job-to-Job Flows (J2J) for job mobility statistics. - LEHD Origin-Destination Employment Statistics (LODES) for employment data based on locations (residence, workplace). __I worked with this extensively so some examples on the next page.__ - Post-Secondary Employment Outcomes (PSEO) for statistics on the earnings and employment outcomes of graduates of select post-secondary institutions in the United States. - Quarterly Workforce Indicators (QWI) for economic indicators including employment, job creation, earnings, and other measures of employment flows. - Specific documentations on LEHD datasets can be accessed [here](https://lehd.ces.census.gov/data/). --- ## Quick demo of the LEHD LODES datasets - LODES data is available at the census block level. - In the dataset that I will demonstrate, I tallied the employment data at the census tract level. - I will quickly show you via an excel workbook what clean LODES datasets at the census tract level looks like. --- ## The American Community Survey (ACS) Data This link provides all the available tables and variables for the 2019 5-year ACS data: https://api.census.gov/data/2019/acs/acs5/variables.html __WARNING:__ This documentation is HUGE--might take a minute to fully load. Normally, ACS documentation is all over the place. Even some excellent resources on ACS data and its variables tend to have sparsed documentations (have to click multiple links). So the link I provided is my own personal tip! Variables may be added and removed across the years. So if you want to check older variable documentations, simply modify the '2019' of the link I just provided. Let us spend a minute or two to see which variables may be included. The variables included might surprise you. Let us look up 'poverty', 'internet' and 'computer' as examples to determine the comprehensiveness of the ACS data. --- ### Research scenario and the censusapi package .pull-left[ We are researchers trying to understand the role of __household internet access__ in influencing health insurance coverage at the county level. As an initial try, let us get the number of county-level households for the entirety of the US. _Partial data is displayed below_ ] .pull-right[ Initial setup: ```r # If you don't have this package, # run: install.packages("censusapi") library(censusapi) Sys.setenv(CENSUS_KEY="YOUR API KEY HERE") household_data <- getCensus( name = "acs/acs5", # requests ACS5 data vintage = 2019, # requests 2019 data vars = c("B28002_001E"), #requests variable(s) region = "county:*") #requests geography ``` ]
--- ## Advanced Census API calls An example of asking for multiple variables: ```r data2019 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "county:*") ``` An example of asking asking for Connecticut-only county-level data (Note: CT's FIPS code is 09). ```r data2 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "county:*", regionin = "state:09") ``` --- ## Advanced Census API calls An example of asking for multiple variables at the Census Tract level: ```r data2019 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "tract:*", regionin = "state:09") ``` Requesting tract-level data requires a specific state input as demostrated above. Here's a simple work around I came up with to avoid having to write 50+ function calls: ```r state_fips <- c('01','02','04','05','06','08','09','10','11','12','13','15','16','17','18','19' ,'20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35', '36','37','38','39','40','41','42','44','45','46','47','48','49','50','51','53', '54','55','56','60','66','69','72','74','78') data2019 <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "tract:*", regionin = "state:01") for(i in 2:length(state_fips)){ new_data <- getCensus(name = "acs/acs5", vintage = 2019, vars = c("B28002_001E", "B28002_002E"), region = "tract:*", regionin = paste0("state:", state_fips[i])) data2019 <- rbind(data2019, new_data) Sys.sleep(2) # Sys.sleep() Tells R to pause for 2 seconds # To prevent server of misclassifying your requests as DDOS attacks) } ``` --- ## Making Maps the Simple Way There are plenty of map-related resources online. They are often confusing or convoluted, I contend, due to the lack of fundamental understanding of ggplot or the tidyverse package as a whole. I want to give you a simple but effective template you can reference should you find a need to make a county-level map. The data we will plot today is from the CDC's COVID-19 Integrated County View dashboard as of 11/24/2021 (their latest data prior to this presentation). Click [here](https://covid.cdc.gov/covid-data-tracker/#county-view) to access their dashboard. --- ### County-level Vaccination Data (As of 11/24/21)
--- ## Mapping Code ```r library(tigris) library(dplyr) library(ggplot2) # Downloads county-level Census shapefile county_shape <- counties(cb = TRUE) %>% filter(!(STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))) # Reads the CDC Data vaccination_data <- read.csv("~/BIS679/DplyrCensus/vaccination_data.csv") # Combines the two data # Uses county_shape's GEOID and vaccination_data's FIPS variables to link combined_data <- left_join(county_shape, vaccination_data, by = c("GEOID" = "FIPS")) ggplot(data = combined_data) + geom_sf(aes(fill = Series_Complete_Pop_Pct), color = NA) + scale_fill_gradient(low="#1fa187", high="#440154") + labs(fill='Vaccinated\nPopulation (%)\n(As of 11/24/21)') + theme_void() ``` Output on the next slide --- ### Output of map-generating code --- ## Conclusion Thank you for attending today's guest lecture. Keep in touch--I mean this part! __Email:__ For general questions, you may email me at `jfeliciano@alumni.harvard.edu` or `jfeliciano@aya.yale.edu`. For USDA-related questions on the work we do, you may email me at `josemari.feliciano@usda.gov`. __Twitter:__ `@SeriFeliciano` __LinkedIn:__ https://www.linkedin.com/in/jmtfeliciano/ __Bonus:__ My favorite penguin species is the Adelie. Click [here](https://www.youtube.com/watch?v=Z7PlUGbsXlQ) for a cute two-minute clip from BBC.